First things first, we need to set up some variables. headers for our games, and home_grounds so we know which team plays at home where.

Now we need to load in the CSVs, which is pretty easy. We'll do a recursive glob over our repo folder to get all the CSVs into a list.

Of course, now we need to actually load in the CSV files. Since the files have an info section and a balls section, we need to check which line we're working with and act accordingly. We'll store all the info as a dictionary for later access, and convert each ball into its own dictionary object and store it into a list of balls.

We'll also set a variable for determining what the latest season is, so that we have a test dataset.

Here we set up some empty lists so that we can do some iterative math. We want to know how many games each team played away and at home, what the runs scored and lost were, and what the wickets scored and lost were. We'll use these later to calculate strength of the teams, to inform the expected value.

Now we'll actually iterate over all the games. We'll ignore the latest season, since that's the test set. We'll store sums of runs and wickets into their respective lists, and increment a counter each time a team plays at home or away.

Here we are going to calculate a bunch of means against all our arrays. We'll need these averages for when we calculate the strength ratios of teams later on.

Just to be sure, let's visually inspect and ensure that every team has played at home at least once.

Here we are going to set up some functions for calculating strengths and weaknesses of each team. We predict the expected score of each team by measuring its score as a ratio of its game count and the league average. We do this for both home and away fielding and batting, and then cross-tabulate them later to get our predicted scores. We also set up a neutral scores function, in case both teams are playing away (which happens frequently)

We repeat what we did above, but for wickets.

Time to calculate the poisson distribution. This will tell us the probability of our given events (knowing that the run of each ball is independent of every other ball over a given innings). Note that it's really important to make sure we also account for neutral ground here, since that will inform our prediction.

Also, let's create a tabular view so we can see what the most probable values are!

Let's repeat the process, but swap the home ground and see what happens!

We notice there are differences in home vs away, which is what we were looking for.

The next step is to create a probability matrix. This is achieved by multiplying the two Poisson distributions together. Although since there is a theoretical max score of 720 per innings, we'd expect that our probabilities themselves are quite small (e.g. 148 to 122 should produce a small number!).

We made the collective decision not to calculate for every possible run score, and so the matrix bottoms out at 50 and caps out at 250.

Also note that we can only use historical scores of the team and are limited by the youngness of the WBBL (as of this writing the WBBL has only hosted 210 matches, whereas the IPL has hosted 765). When each team has played hundreds of matches we can track the performance of each team much better (and end up with a much bigger matrix!)

Let's see if we can't figure out what our most probable run score is for this game.

As seen above, our suspicions were correct - matchups between teams are pretty even, and the odds of the score being a blowout are vanishingly small (at least in terms of the total runs scored, not accounting for technical wins e.g. wickets).

Our next step is to make a table of the offence and defence of each team, just to check that our math isn't way off.

Really interestingly, the home field advantage seems to be all over the place. We suspect this is because some teams get to play disproportionately more games on their home fields than other teams.

This could also be due to the nature of the sport - where in cricket there are more fielders and only 2 batsmen.

We'll do the same table for wickets.

Now it's time to actually calculate the predicted scores. We'll iterate over the last season of games and run a different calculation based on our assessment of which teams are playing at home. We'll append those to a final list that we'll use to verify against Pearson standardised residuals and density plots.

Now it is time to calculate the standardised residuals. Statsmodels makes this pretty easy. We'll join all the observed values into one list and place them in a contigency table with the expected values.

Now it's time to plot. We'll start with runs, and look at residuals vs fitted values (i.e. regression)

Next let's see if there's an observable trend with the raw observations against the standardised residuals.

Lastly, let's do a density plot to see if there's a skew towards over or underestimation.

Now let's repeat for wickets